Author: Dawid Pludowski
import pandas as pd
import pickle as pkl
import dalex as dx
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
Most data preprocessing was done for purpose of previous homeworks; only python-wise preprocessing, such as managing with categories, is required.
df = pd.read_csv('../data_scaled.csv')
df.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | rooms_per_household | bedrooms_per_room | population_per_household | ocean_proximity | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.327835 | 1.052548 | 0.982143 | -0.804819 | -0.972476 | -0.974429 | -0.977033 | 2.344766 | 0.628559 | -1.149930 | -0.049597 | NEAR BAY | 452600.0 |
| 1 | -1.322844 | 1.043185 | -0.607019 | 2.045890 | 1.357143 | 0.861439 | 1.669961 | 2.332238 | 0.327041 | -0.990381 | -0.092512 | NEAR BAY | 358500.0 |
| 2 | -1.332827 | 1.038503 | 1.856182 | -0.535746 | -0.827024 | -0.820777 | -0.843637 | 1.782699 | 1.155620 | -1.445865 | -0.025843 | NEAR BAY | 352100.0 |
| 3 | -1.337818 | 1.038503 | 1.856182 | -0.624215 | -0.719723 | -0.766028 | -0.733781 | 0.932968 | 0.156966 | -0.493627 | -0.050329 | NEAR BAY | 341300.0 |
| 4 | -1.337818 | 1.038503 | 1.856182 | -0.462404 | -0.612423 | -0.759847 | -0.629157 | -0.012881 | 0.344711 | -0.707889 | -0.085616 | NEAR BAY | 342200.0 |
mapping = {
'NEAR BAY': 0,
'ISLAND': 1,
'NEAR OCEAN': 2,
'<1H OCEAN': 3,
'INLAND': 4
}
df['ocean_proximity'] = df['ocean_proximity'].map(mapping)
X = df.drop(columns=['median_house_value'])
y = df[['median_house_value']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import RandomizedSearchCV
dt = DecisionTreeRegressor(max_depth=10)
dt_tuned = RandomizedSearchCV(
dt,
{
'criterion': ['squared_error', 'absolute_error'],
'max_depth': [i for i in range(5, 25, 2)],
'min_samples_split': [i for i in range(2, 10)],
'min_samples_leaf': [i for i in range(1, 5)]
},
n_iter=15,
random_state=2137
)
dt_tuned.fit(X_train, y_train)
print(dt_tuned.best_estimator_)
# with open('decision_tree.pkl', 'rb') as file:
# dt_tuned = pkl.load(file)
rf = RandomForestRegressor()
rf_tuned = RandomizedSearchCV(
dt,
{
'criterion': ['squared_error', 'absolute_error'],
'max_features': ['sqrt', 'log2'],
'min_samples_split': [i for i in range(2, 10)],
'min_samples_leaf': [i for i in range(1, 5)],
'max_depth': [i for i in range(3, 10, 2)]
},
n_iter=15,
random_state=2137
)
rf_tuned.fit(X_train, y_train)
print(rf_tuned.best_estimator_)
# with open('random_forest.pkl', 'rb') as file:
# rf_tuned = pkl.load(file)
mlp = MLPRegressor(
random_state=2137
)
mlp_tuned = RandomizedSearchCV(
mlp,
{
'hidden_layer_sizes': [
(10, 100, 20),
(5, 50, 50, 10),
(25, 100, 20)
]
},
n_iter=3
)
mlp_tuned.fit(X_train, y_train)
print(mlp_tuned.best_estimator_)
# with open('mlp.pkl', 'rb') as file:
# mlp_tuned = pkl.load(file)
print(f'decision tree score: {dt_tuned.score(X_test, y_test)}')
print(f'random forest score: {rf_tuned.score(X_test, y_test)}')
print(f'neural network score: {mlp_tuned.score(X_test, y_test)}')
decision tree score: 0.7122037298977866 random forest score: 0.6655753723827369 neural network score: 0.7030360456509592
Decision tree's performance get best scoring, so we will consider it as base model in the following part of the notebook.
observation = df.iloc[[2137]]
observation
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | rooms_per_household | bedrooms_per_room | population_per_household | ocean_proximity | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2137 | -0.075017 | 0.551589 | -1.083767 | -0.211208 | 0.064765 | -0.204406 | -0.045877 | -0.62848 | -0.370457 | 0.803675 | -0.057143 | 4 | 87500.0 |
prediction = dt_tuned.predict(observation.drop(columns=['median_house_value']))
true_value = observation['median_house_value']
print(f'true value: {int(true_value)}')
print(f'predicted value: {prediction[0]}')
true value: 87500 predicted value: 79850.0
Predicted value is close to real one.
dt_explainer = dx.Explainer(
dt_tuned,
X,
y,
label='decision tree'
)
Preparation of a new explainer is initiated -> data : 20640 rows 12 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 20640 values -> model_class : sklearn.model_selection._search.RandomizedSearchCV (default) -> label : decision tree -> predict function : <function yhat_default at 0x000001F58FBF98B0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 3.62e+04, mean = 2.02e+05, max = 5e+05 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -3.88e+05, mean = 4.6e+03, max = 4.31e+05 -> model_info : package sklearn A new explainer has been created!
dt_observation_exp = dt_explainer.predict_profile(observation.drop(columns=['median_house_value']))
Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 12/12 [00:00<00:00, 398.05it/s]
dt_observation_exp.plot(
variables=['median_income', 'ocean_proximity', 'households', 'housing_median_age']
)
Despite the highest scoring, decision tree's decisions are based only on 2-3 variables out of 12 (rest of plots are not shown for sake of notebook clarity). It may suggest that this kind of model cannot use full information that is hidden in data and thus, other models should be considered.
The greatest impact on the prediciton has median_income and it follows the rule the richer inhabitants are, the more expensive the neighbourhood is, which is reasonable.
rf_explainer = dx.Explainer(
rf_tuned,
X,
y,
label='random forest'
)
mlp_expaliner = dx.Explainer(
mlp_tuned,
X,
y,
label='neural network'
)
Preparation of a new explainer is initiated -> data : 20640 rows 12 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 20640 values -> model_class : sklearn.model_selection._search.RandomizedSearchCV (default) -> label : random forest -> predict function : <function yhat_default at 0x000001F58FBF98B0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 5.24e+04, mean = 1.99e+05, max = 5e+05 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -3.58e+05, mean = 8.16e+03, max = 3.96e+05 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 20640 rows 12 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 20640 values -> model_class : sklearn.model_selection._search.RandomizedSearchCV (default) -> label : neural network -> predict function : <function yhat_default at 0x000001F58FBF98B0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 3.14e+04, mean = 2.08e+05, max = 7.78e+05 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -6.11e+05, mean = -1.2e+03, max = 4.2e+05 -> model_info : package sklearn A new explainer has been created!
rf_observation_exp = rf_explainer.predict_profile(observation.drop(columns=['median_house_value']))
mlp_observation_exp = mlp_expaliner.predict_profile(observation.drop(columns=['median_house_value']))
Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 12/12 [00:00<00:00, 428.69it/s] Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 12/12 [00:00<00:00, 387.16it/s]
dt_observation_exp.plot((rf_observation_exp, mlp_observation_exp), variables=['median_income', 'ocean_proximity', 'households', 'housing_median_age'])
The models comparsion in that certain observation shows that neural network might be more sensitive on changes in households variables. Moreover, full CP plots (not shown in the notebook) suggest that change in any variable has impact on network decision, whilst it is not true for random forest and decision tree. Further analysis should be performed to check whether neural network changes in CP profiles are reasonable; if so, neural network should be considered as the best model to estimate median price, as its scoring is only a bit lower than in decision tree and its prediction is more subtle.
CP plots show that the best model (in terms of scoring) may not take all information into account and due to that fact, be poor explainer of the real world. However, one should remember that dataset do contain some interactions (like longitute and latitude) and correlation (ratio variables) and because of that, CP plots are not methods to explain model performance.